[1/N] Rs/vllm quantization - Refactor to minimize `llama.py` changes by robertgshaw2-redhat · Pull Request #186 · neuralmagic/nm-vllm

robertgshaw2-redhat · 2024-04-13T01:49:59Z

Paired with @dsikka to refactor SmoothQuantLinearMethod to avoid making changes to llama.py

Removed all the "layer specific" SmoothQuantLinearMethod by making the indexing (splitting QKV into logical shards generic and explicitly handling state_dict converion
Successfully whittled down to only add one LOC to llama.py

Many todos left, including:

We currently have hardcoded use_per_token, need to use the quant config for this
We need a way to pass different quantconfigs to each layer to support nonuniform quantization

… to remove changes to Llama.py

varun-sundar-rabindranath · 2024-04-16T14:53:02Z

vllm/model_executor/layers/linear.py

                       output_size_per_partition: int, input_size: int,
                       output_size: int,
-                       params_dtype: torch.dtype) -> Dict[str, Any]:
+                       params_dtype: torch.dtype, logical_widths: Optional[List[int]]) -> Dict[str, Any]:


lift this to be inside LinearMethodBase ?

I got rid of this on the next pr

varun-sundar-rabindranath · 2024-04-16T14:55:25Z

vllm/model_executor/layers/quantization/smoothquant.py

@@ -1,8 +1,9 @@
-from typing import Any, Dict, List, Tuple, Optional


very nice cleanup 🙌

varun-sundar-rabindranath · 2024-04-16T14:56:08Z

vllm/model_executor/model_loader.py

            if _is_support_smoothquant(model_config):
-                model = model_class(model_config.hf_config, linear_method,
-                                    quant_config)
+                model = model_class(model_config.hf_config, linear_method)


How come we don't have to pass in the quant_config ? because the linearmethod already knows if it is quantized ?

Yeah linear method handles it

varun-sundar-rabindranath · 2024-04-16T15:04:54Z

LGTM.

… via config (#188) Refactored to support nonuniform quantization by adding a new layer of Abstraction. Now, `SmoothQuantLinearMethod` can hold a `SmoothQuantFormat`, which implements the details of how to do quant and dequant operations. There are two `SmoothQuantFormat` classes: - `SmoothQuantDynamicPerToken` - `SmoothQuantStaticPerTensor` We have the following lifecycle: - `LinearMethod` is created during `get_model`, has access to `QuantizationConfig` - `Layer` is initialized and passed a `LinearMethod` - `Layer` calls `LinearMethod.create_weights`, which creates a dictionary of weights and metadata - `Layer` calls `LinearMethod.apply_weights` during inference, passing the dictionary created during `create_weights` This PR modifies the `LinearMethod.create_weights` API to receive a `layer_name` as argument. The `LinearMethod` then looks in the `config` to determine which `SmoothQuantFormat` to use for the layer with `layer_name` - As a result, the `LinearMethod` is responsible for parsing the config from disk and making decisions about what the inference format should look like. In this specific case, since the `SmoothQuantConfig` is not very good, we just match on the suffix `qkv` to determine what each layer should use --> but for SparseMLConfig, we could use a similar structure In this PR, the `SmoothQuantFormat` is passed in the dictionary returned by `create_weights` and then is used by `apply_weights` ### In Summary I think this is a good overall structure because it: - (a) allows us to make minimal changes to the existing models - (b) allows us to make no changes to the model loading lifecycle (i.e. config / constructor / linear method) ** critically requires having one LinearMethod that propagates through the whole model - (c) encapsulates the nonuniform logic into the `LinearMethod`, allowing us to have a clean interface into ### For SparseML Models We could imagine the following architecture: #### Config Config is responsible for: - loading config from disk - mapping layer_names --> `SparseMLFormat` ```python class SparseMLConfig def from_dict() def get_layer_format(layer_name): return SparseMLFormat ``` #### LinearMethod Config is responsible for: - interface between layers and kernels (so LinearMethod is what is used by the model) ```python class SparseMLLinearMethod: def __init__(self, sparseml_config) self.sparseml_config = sparseml_config def create_weights(layer_name, ...): # this, e.g. is where nonuniform might be supported format = self.sparseml_config.get_layer_format(layer_name) weights = format.get_weights() weights["format"] = format return weights # wrapper around the SparseML format def apply_weights(x, weights, ...) format = weights["format"] weights = weights["weights"] return format.apply_weights(x, weights) ``` #### SparseMLFormat Format is responsible for: - actual weight creation and forward ```python class SparseMLLinearMethod: def __init__(self, sparseml_config) self.sparseml_config = sparseml_config def get_weights(sizes): # returns dictionary , e.g. return { "weights": x "scales": y } def apply_weights(weights, x): # calls cuda kernel return output ``` Sample Formats: - `W8A8DynamicPerToken` - `SparseW8A8StaticPerTensorAsymmetric` - `W4A8DynamicPerToken` - ...

Robert Shaw added 10 commits April 12, 2024 20:30

first end to end run with rowparallellinear in the fun format

2a2387e

got qkvproj working with funsqlinearmethod

7373f4c

converted all SQLinearMethods to use FunSQLinearMethod -- now working…

a3d4ee5

… to remove changes to Llama.py

stash

4bb0275

updated llama.py to minimze changes

fa35654

updated llama.py to minimze changes

53f1912

minimize changes to llama.py

f62ff3a

minimize changes to llama.py

888135f

tweak

d5d223b

updated llama weight loading

10551e4

robertgshaw2-redhat requested review from dsikka and varun-sundar-rabindranath April 13, 2024 01:50

added TODO

a0480f7

robertgshaw2-redhat changed the title ~~Rs/vllm quantization~~ [1/N] Rs/vllm quantization Apr 14, 2024

robertgshaw2-redhat changed the title ~~[1/N] Rs/vllm quantization~~ [1/N] Rs/vllm quantization - Refactor To Remove SQLinearMethod Variants Apr 14, 2024

robertgshaw2-redhat changed the title ~~[1/N] Rs/vllm quantization - Refactor To Remove SQLinearMethod Variants~~ [1/N] Rs/vllm quantization - Refactor to minimize llama.py changes Apr 14, 2024

varun-sundar-rabindranath reviewed Apr 16, 2024

View reviewed changes

varun-sundar-rabindranath approved these changes Apr 16, 2024

View reviewed changes

varun-sundar-rabindranath merged commit 3e7d1c8 into vllm-quantization Apr 16, 2024

varun-sundar-rabindranath deleted the rs/vllm-quantization branch April 16, 2024 17:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[1/N] Rs/vllm quantization - Refactor to minimize `llama.py` changes#186

[1/N] Rs/vllm quantization - Refactor to minimize `llama.py` changes#186
varun-sundar-rabindranath merged 12 commits intovllm-quantizationfrom
rs/vllm-quantization

robertgshaw2-redhat commented Apr 13, 2024 •

edited

Loading

Uh oh!

varun-sundar-rabindranath Apr 16, 2024

Uh oh!

robertgshaw2-redhat Apr 16, 2024

Uh oh!

varun-sundar-rabindranath Apr 16, 2024

Uh oh!

varun-sundar-rabindranath Apr 16, 2024 •

edited

Loading

Uh oh!

robertgshaw2-redhat Apr 16, 2024

Uh oh!

varun-sundar-rabindranath commented Apr 16, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		@@ -1,8 +1,9 @@
		from typing import Any, Dict, List, Tuple, Optional

Conversation

robertgshaw2-redhat commented Apr 13, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

varun-sundar-rabindranath Apr 16, 2024

Choose a reason for hiding this comment

Uh oh!

robertgshaw2-redhat Apr 16, 2024

Choose a reason for hiding this comment

Uh oh!

varun-sundar-rabindranath Apr 16, 2024

Choose a reason for hiding this comment

Uh oh!

varun-sundar-rabindranath Apr 16, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

robertgshaw2-redhat Apr 16, 2024

Choose a reason for hiding this comment

Uh oh!

varun-sundar-rabindranath commented Apr 16, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

robertgshaw2-redhat commented Apr 13, 2024 •

edited

Loading

varun-sundar-rabindranath Apr 16, 2024 •

edited

Loading